An index for regular expression queries: Design and implementation

نویسندگان

  • Dominic Tsang
  • Sanjay Chawla
چکیده

The like regular expression predicate has been part of the SQL standard since at least 1989. However, despite its popularity and wide usage, database vendors provide only limited indexing support for regular expression queries which almost always require a full table scan. In this paper we propose a rigorous and robust approach for providing indexing support for regular expression queries. Our approach consists of formulating the indexing problem as a combinatorial optimization problem. We begin with a database, abstracted as a collection of strings. From this data set we generate a query workload. The input to the optimization problem is the database and the workload. The output is a set of multigrams (substrings) which can be used as keys to records which satisfy the query workload. The multigrams can then be integrated with the data structure (like B+ trees) to provide indexing support for the queries. We provide a deterministic and a randomized approximation algorithm (with provable guarantees) to solve the optimization problem. Extensive experiments on synthetic data sets demonstrate that our approach is accurate and efficient. We also present a case study on PROSITE patterns which are complex regular expression signatures for classes of proteins. Again, we are able to demonstrate the utility of our indexing approach in terms of accuracy and efficiency. Thus, perhaps for the first time, there is a robust and practical indexing mechanism for an important class of database queries. ∗A short version of this paper is published in the 20th ACM Conference on Information and Knowledge Management (CIKM2011) at at http://www.cikm2011.org/ and ACM DL

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimization of Regular Expression Evaluation within the Manatee Corpus Management System

This paper is concerned with searching large text corpora – electronic collections of texts. Often these are subject to queries specified by means of regular expressions. Such queries go beyond a simple keyword search that can be quickly evaluated using an inverted index, usually they are rather processed by third-party regular expression libraries and take significantly more time to evaluate. ...

متن کامل

بهبود الگوریتم انتخاب دید در پایگاه داده‌‌ تحلیلی با استفاده از یافتن پرس‌ وجوهای پرتکرار

A data warehouse is a source for storing historical data to support decision making. Usually analytic queries take much time. To solve response time problem it should be materialized some views to answer all queries in minimum response time. There are many solutions for view selection problems. The most appropriate solution for view selection is materializing frequent queries. Previously posed ...

متن کامل

Extending the Qualitative Trajectory Calculus Based on the Concept of Accessibility of Moving Objects in the Paths

Qualitative spatial representation and reasoning are among the important capabilities in intelligent geospatial information system development. Although a large contribution to the study of moving objects has been attributed to the quantitative use and analysis of data, such calculations are ineffective when there is little inaccurate data on position and geometry or when explicitly explaining ...

متن کامل

A New Design for a Native XML Storage and Indexing Manager

This paper describes the design and implementation of an XML storage manager for fast and interactive XPath expressions evaluation. This storage manager has two main parts: the XML data storage structure and the index over this data. The system is designed in such a way that it minimizes the number of page reads for retrieving any XPath expression results while avoiding the shortcomings of prev...

متن کامل

Indexing for String Queries using

We present ve index trees designed for supporting string searches. We discuss Counted Trees, Substring Trees, and Regular Expression Trees, all of which suuer the same problem { their attempts at approximating a large set of data lead to an almost complete lack of information in the interior nodes. A variant of B-Trees, called a Preex-Suux Tree, avoids this diiculty by severely restricting the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1108.1228  شماره 

صفحات  -

تاریخ انتشار 2011